William Gann's Hyperparameter Tuning of Feature Engineering Pipelines using GridSearchCV
Hyperparameter tuning is the process of finding the optimal hyperparameters for a machine learning model. In the context of feature engineering pipelines, this includes tuning the parameters of the transformers, such as the window size of a moving average or the number of components in a PCA.
Scikit-Learn's GridSearchCV provides a simple and effective way to perform hyperparameter tuning.
GridSearchCV and RandomizedSearchCV
GridSearchCV performs an exhaustive search over a specified parameter grid. RandomizedSearchCV, on the other hand, performs a randomized search over a specified parameter distribution. RandomizedSearchCV is often preferred for large parameter spaces, as it is more computationally efficient.
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
# Assume pipeline is our feature engineering pipeline
parameters = {
'moving_average__short_window': [10, 20, 50],
'moving_average__long_window': [100, 200, 300],
}
grid_search = GridSearchCV(pipeline, parameters, cv=5)
grid_search.fit(X, y)
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
# Assume pipeline is our feature engineering pipeline
parameters = {
'moving_average__short_window': [10, 20, 50],
'moving_average__long_window': [100, 200, 300],
}
grid_search = GridSearchCV(pipeline, parameters, cv=5)
grid_search.fit(X, y)
Cross-Validation Strategies for Financial Time Series
Standard cross-validation techniques, such as k-fold cross-validation, are not suitable for financial time series data, as they can lead to data leakage. Instead, we need to use cross-validation strategies that are specifically designed for time series data, such as TimeSeriesSplit.
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
grid_search = GridSearchCV(pipeline, parameters, cv=tscv)
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
grid_search = GridSearchCV(pipeline, parameters, cv=tscv)
Mathematical Formulation: Cross-Validation
The goal of cross-validation is to estimate the generalization error of a model. The average of the errors over the k-folds is used as the estimate of the generalization error.
E_{cv} = rac{1}{k} \sum_{i=1}^{k} E_i
Where:
- $E_{cv}$ is the cross-validation error.
- $k$ is the number of folds.
- $E_i$ is the error on the i-th fold.
| Parameter | Value |
|---|---|
| short_window | 20 |
| long_window | 200 |
| score | 0.75 |
By tuning the hyperparameters of your feature engineering pipeline, you can significantly improve the performance of your trading models.
